Predictions for COVID-19 in India

Motivation

Recently, India has experienced an increase in COVID-19 cases and deaths. With insufficient medical resources and poor living conditions, many people have to unfortunately be turned away from hospitals in favor of people with more severe cases. In this notebook, we will attempt to predict the number of deaths in India using machine learning techniques for the next 3 months. This information will be helpful to hospitals and the government to better inform the amount of money and resources that need to be allocated in their respective areas.

By Irfaan Jamarussadiq

Data Sources

Part 1: Exploring the Total Counts for India

We will start by importing some of the libraries we will need. The requests library is used to acquire the .json file from the internet and get its contents in a JSON format. The json library is then used to place that data into a local json file and convert it into a dictionary. Finally, we use pandas to convert the dictionary into a dataframe so that we can more easily plot and visualize the data.

Because the data is stored in a specific JSON format, we need to read the JSON file from the web and read it into a JSON data structure. We will start by acquiring the all_totals.json file, which contains the totals for the number of active cases, number of deaths, number of people cured, and the total number of confirmed cases, all with the associated timestamps.

As you can see, the data is stored in a key-value pair format, where the key contains the timestamp and attribute name, while the value contains the number associated with that attribute. We can wrangle this data format into a more table-like structure so that we can convert this dictionary into a dataframe.

There is a problem here with the date ranges, and that is that they are not continuous! If we ever want to visualize the data, it will be important to have a continuous date range. We can do this by adding new rows for the missing days, and simply taking the previous row's values as the values for the new rows. We fill in the missing values in this way because all of the metrics in the dataset are cumulative.

Now that we have the data in a managable format, we can start visualizing the data. Let's start by plotting the number of active cases, total number of confirmed cases, number of deaths, and number of people cured.

We can very clearly see from these plots that cases and deaths have been skyrocketing starting around late March to early April 2021. This is when a new strain of the virus came to India. But what made this new strain so difficult to handle compared to any previous ones, and is the situation different by state in India? Let us try to explore this question by looking at the COVID-19 data partitioned by state. This can be found in the Ministry of Health's data, which can be located in mohfw.json.

Part 2: Looking at the Data by State

It seems that this JSON file's format is slightly different than the all_totals's format. We can see that the rows of the data are essentially a set of key-value pairings. Let's further inspect the contents of the JSON.

We now want to extract the relevant data from the dictionary that is needed to display on the maps. Right now, the data is structured as time series data. For the maps, however, we are not concerned with the time cases or deaths occurred, but rather the current numbers for those statistics. Later, when we create a machine learning model for predicting the number of deaths, we will come back to the time series data.

Let's now create the reformatted dictionary and see what value is stored at a particular key, say, Kerala.

Now that we have the dictionary properly formatted, it's time to convert it into a pandas dataframe. Pandas allows us to feed in a dictionary into the dataframe constructor, so we can simply pass extracted2 into that. For more details on how this works, see the documentation at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html.

We can see there are some missing values (a lot actually!) that are labelled as 'unassigned.' For now we will drop these values from the table, although further analysis might be required to truly understand the effect of these values on the data overall.

Now we can proceed by trying to visualize the data on a map. The idea is to create three maps where we color states by their relative number of cases/cured/deaths. The first step is to scale down the data by a factor of a million. This will make the scales in our maps more readable.

We will use the folium library, which is a wrapper around the leaflet.js library, to display the maps. Note that these types of visualizations are called choropeths.

Choropeths Tutorial: https://vverde.github.io/blob/interactivechoropleth.html

Choropeths Documentation: https://python-visualization.github.io/folium/quickstart.html#Choropleth-maps

From these plots, we can seen that things seem pretty dire in Maharashta, the state where Mumbai is located. Maharashta has the greatest number of deaths, cases, and cured people. This makes a lot of sense because Mumbai, India's largest city, is located in Maharashta, and so there are some highly dense city areas there. In a city, especially a dense one, it is more difficult to socially distance, making it easier for COVID-19 to spread.

Looking at the other states, it appears that states in Southern India have it worse in number of cases, deaths, and cured people compared to Northern India. Kerala, which is in Southern India, is an interesting case study, because while it shares similar numbers in terms of cases and number of people cured, it has significantly fewer number of deaths than its neighboring states, such as Tamil Nadu and Karnataka.

Machine Learning

We can now create a model for India's COVID-19 cases to predict how many deaths there will be in the next 3 months. We can do this by accessing the data from data2 that we previously had.

For a tutorial on how to deal with time series data for machine learning, see https://www.pluralsight.com/guides/machine-learning-for-time-series-data-in-python

Here I am adding several new variables to the data based on the datetimes originally in the data. I am doing this because it is possible that there are correlations between the number of COVID-19 related deaths and these new variables. For example, it might be the case that there are more people outside, and thus exposed to COVID-19, on weekends compared to weekdays, so this would be an important variable to keep track of.

Looks like the seconds column has a lot of zeroes ... let's check for sure whether it is all zeroes. If it is, we can safely remove that piece of data.

Now removing seconds from the dataframe, since all 496 values are 0 ...

Although year and month are numbers in the dataset, they are still considered categorical variables because numerical operations (such as adding two years) do not really make sense. For this reason, we need to convert these categorical variables into numeric values. We will do this by creating extra variables for each year and month.

Learn more about the get_dummies function from the documentation: https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

Now that we have created the necessary variables, it's time to start machine learning. I will start by dividing the data into training and test sets by an 80-20 split.

Now that we have the data split appropriately, it's time to create a model. We will use a Decision Tree as our first basic model to see how well it performs.

Here we create the Decision Tree model and evaluate its performance using the root of the mean squared error (RMSE) and the r^2 correlation coefficient.

It seems that the RMSEs of both testing and training are relatively low compared to the numbers in the dataset (which were in the ten millions), and the training and testing accuracies are pretty good! But let's see if we can do even better with a different model.

Here we are using a random forest model and evaluating its performance in the same way.

We can see that here we are doing phenomally well, even better than with the Decision Tree. The RMSEs are somewhat lower than for the Decision Tree, and the accuracies in training and testing are much better. Let us see the number of deaths predicted for India overall now.

We will start by generating the dates required for the next 3 months.

Conclusion